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In response to the rapid growth of many sorts of information, highway data 
has continued to evolve in the direction of big data in terms of scale, type, 
and structure, exhibiting characteristics of multi-source heterogeneous data. 
The k-nearest neighbor (KNN) join has received a lot of interest in recent years 
due to its wide range of applications. Processing KNN joins is time-consuming 
and inefficient due to the quadratic structure of the join method. As the number 
of applications dealing with vast amounts of data develops, KNN joins get 
more sophisticated. The authors seek to save money on computer resources 
by leveraging a large number of threads and multiprocessors. Six popular 
datasets are used to apply the method and evaluate the sequential and 
parallel performance of the KNN technique. These datasets are used to 
compare the sequential and parallel performance of the KNN method. When 
compared to a matching multi-core solution, the final implementation saves 
computing resources. It has been optimized to utilize as little RAM as 
possible, allowing it to manage high-resolution photo data without 
sacrificing efficiency. The authors will use the technique they presented 
using Spark Radoop. Our performance research validates the supplied 
method’s efficacy and scalability. 
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1. INTRODUCTION 


At times, large amounts of data might be difficult to comprehend and even analyze [1], [2]. This 
challenge can be solved by using data mining techniques, which may be used to derive information from the 
data collected. The twenty-first century is the era [2] in which there is an enormous quantity of data, and that 
number is growing at an alarming rate. When dealing with such large amounts of data, the usual performance 
of computers is insufficient, and more intelligence is required to keep up with the current expansion of big 
data [3]-[17]. High-performance computing (HPC) should be taken into account while designing and 
creating software since it has the potential to increase computation speed while also giving more accurate 
results at a lower total cost than traditional computing [18]—-[25]. HPC may be classified into a variety of 
categories, the most prominent of which are: computer clusters, grid computing, cloud computing, graphical 
processing units (GPU), microprocessors, and field programmable gate array (FPGA) [4], [20]-[22], [24]. 
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In order to get high performance at the GPU level, it is possible to employ distributed memory programming 
with several processors, which is one of the ways that was used to achieve this outcome. 

High-speed computing may be used to deal with large amounts of data in a variety of ways, one of 
which is through parallelizing machine learning algorithms and other methods. As a result of the simplicity 
and effectiveness with which it classifies data, the k-nearest neighbors (KNN) algorithm [26]—[41], is one of 
the most extensively used machine learning methods. KNN is a classification strategy that selects the item to 
category based on its proximity to the nearest point in the training set. If this approach is compared to other 
algorithms, it is considered to be straightforward, capable of dealing with a noisy dataset, and capable of 
producing decent results. The algorithm’s disadvantage, on the other hand, is that it is difficult to compute 
and that the cost of putting it into practice is significant, as previously said. The method’s parallel construction 
is utilized to reduce the complexity of the computation as well as the time required to do the distance calculation 
for each point in the dataset. As a result, the purpose of this article is to accelerate the KNN method through the 
use of multiple threads and multiple processors, thereby reducing the time required to run the algorithm. 

Outlier identification, classification, molecular motion analysis, geographic databases, and pattern 
recognition are just some of the applications for which the KNN join and its variations have lately received 
a great deal of interest. It has also been stated that the performance of the popular k-means or k-medoid 
classification algorithms may be greatly enhanced by incorporating KNN joins into the process. When two 
relations, R and S, are joined together, the KNN join is performed. It yields every point in one relation R and 
the top-k nearest points in the other relation S. When dealing with a high-dimensional dataset, we take into 
consideration the difficulty posed by determining the k nearest neighbors of a query point. In order to 
efficiently handle this problem, we want to improve the performance of an existing method by parallelizing it 
as well as making it more tolerant to stragglers. While the KNN issue is not a new concept, it is frequently 
employed as a first step in a wide range of real-world applications, including genomics and customized 
search, network security, and web-based recommendation systems. As data and dimensionalities expand in 
the age of big data, KNN algorithms are frequently found to be a bottleneck. There is a large amount of 
research on the topic of quick closest neighbor retrieval [32]—[41]. 

An implementation of the parallel KNN technique based on MapReduce with a heterogeneous 
cluster was published in a work by [42] and it was carried out by employing the block nested loop strategy 
for KNN-joins. Data is partitioned into equal-sized blocks during the map stage, and these blocks are then 
subdivided into buckets during the sort step. Following that, during the reduce phase, the reducer does a block 
nested loop KNN join for each bucket that will be saved as depth first search (DFS) files, and the procedure is 
repeated for each bucket that will be stored as DFS files. Analysis of the findings revealed that the partitions 
were balanced with respect to the running duration and that speedup increased with the increasing size of the 
cluster. Communication using Hadoop block R-tree join, on the other hand, was found to be twice as effective 
as communication using Hadoop Z-value KNN Join. Furthermore, researchers discovered that as the dimension 
of communication increases, there is a decrease in recall and precision, leading to a lower ability to reach more 
than the average. It was published in the Journal of Scientific Computing in 2012 that a second study by 
Sismanis et al. [43] used the KNN algorithm for greater dimension with multi-core processors in the graphics 
processing unit (GPU). The parallelization of the KNN algorithm in the sorting process is accomplished through 
the use of truncated bitonic sorting techniques (TBiS). According to the conclusions of the study, it was found 
that the performance of the GPU with TBiS outperforms that of the sort and pick procedures. 

Rajani et al. [44] investigated parallel k-nearest neighbor execution using tree-based data structures 
in open multi-processing) and the Galois framework. To implement KNN in OpenMP, four different forms of 
threading can be employed: 1, 4, 8, and 16 threads. Amdahl’s law was used to assess how much quicker the 
implementation was compared to the original. With modified national institute of standards and technology 
(MNIST) datasets, the researchers observed that ball trees beat k-d trees in terms of performance in higher 
dimensions. They also discovered that ball trees are more efficient than Scikit-learn in Python when compared 
to other machine learning techniques. However, when the size of the datasets shrinks, the processing time gets 
faster, with a linear speedup for the datasets as the size of the datasets decreases. In [45]—[48] produced an 
OpenCL-based implementation of the KNN method for usage on FPGA and GPU systems, which was 
published in the Journal Scientific Reports. In order to allow the researchers to do distance calculation and 
distance ranking in parallel, they performed the bubble sort after data was sent from the CPU to the FPGA. 
Initially, a parallel distance calculation is carried out to estimate the distance between each item, and then a sorting 
operation is carried out to discover which of the items has the k-smallest distance between them. According to the 
findings of the study, when it comes to calculating speed, the GPU surpasses the CPU. When the average is used, 
the FPGA version of Joule, on the other hand, surpasses the GPU implementation. KNN performance depends on 
k and the proximity measure. If k is too low, the test sample will be impacted by noisy points and may overfit. 
Alternatively, if k is too high, the neighborhood may not adequately represent the class. In this study, we evaluated 
k values of 1, 10, and 100 (fine, medium, and coarse neighborhoods) to see if the model’s accuracy was influenced. 
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Also, we used several distance metrics to evaluate their overall efficacy. When employing proximity-based 
classifiers, the size and range of the data values must be addressed. So, we used Z-score normalization to 
reduce problems with closeness predictions. 

The purpose of this paper is to outline all of the particulars and requirements that must be met in order 
to constructs to save money on computer resources by leveraging a large number of threads and 
multiprocessors. Six popular datasets are used to apply the method and evaluate the sequential and parallel 
performance of the KNN technique. These datasets are used to compare the sequential and parallel performance 
of the KNN method. This paper is outlined as: section 2 reviewed the related works regarding the recommended 
method, while sections 3 provided a brief discussion of the suggested methodology. Section 4 analyzed the 
evaluation results from the several experiments conducted in this study, while section 5 concluded the paper. 


2. RELATED WORKS 
2.1. Apache Spark Radoop 

RapidMiner Radoop expands the typical RapidMiner in-memory capabilities by offering advanced 
operators that are built for execution in the Hadoop data processing system [47], [49]-[53]. Data 
transformations and sophisticated and predictive modeling are supported by more than 60 operators that 
operate on a distributed Hadoop cluster in the same way as Hadoop itself. RapidMiner Radoop makes use of 
RapidMiner Studio’s visual workflow designer to make the creation, execution, and maintenance of 
predictive analytics in Hadoop simpler. RapidMiner Radoop is available now. The code-free environment 
and built-in intelligence reduce the complexity of Hadoop, allowing you to concentrate on solving business 
challenges rather than being distracted by dead ends and technical obstacles. Radoop takes care of the 
workflow execution so that the user doesn’t have to worry about it. All of the calculations are done in the 
Hadoop cluster, which is where the data lives. This makes predictive analytics effective and highly scalable, 
even for terabytes and petabytes of data [6], [54]-[75]. 

RapidMiner is an addon that we developed to guarantee that Hadoop is fully integrated with 
RapidMiner [76]. Radoop, as a data science software platform, reduces the difficulties associated with data 
preparation and machine learning on Hadoop and Radoop Spark, as shown in Figure 1. Radoop provides 
additional operators for RapidMiner and communicates with the Hadoop cluster in order to complete task 
execution. All operations and data processing in RapidMiner Studio operate in parallel as a result of the use 
of SparkRM, which is part of the Hadoop ecosystem. This is achieved by the use of Apache Spark as the job 
execution tool, which allows for the expansion of use cases and the development of more powerful 
algorithms as compared to MLIib. Several Hive and Mahout data analytics routines were reused in this study 
due to the highly optimized nature of certain of their data analytics capabilities. This research developed an 
addition that would aid in the achievement of tight integration while also providing the same Hadoop features 
that are typically used in memory-based RapidMiner operations. First, the RadoopNest meta-operator is 
added to the RadoopNest meta-operator, which holds the general cluster settings such as the IP address of the 
Hadoop master node, and then the RadoopNest meta-operator is removed from the cluster. The remaining 
Radoop operators can only be used within this metaoperator and not outside of it [47], [51]-[53], [77]. 
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Figure 1. Spark Radoop architecture 
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2.2. MapReduce 

MapReduce is a programming methodology created by Google to handle large-scale data analysis. 
It is based on the Hadoop framework [11], [58], [78]-[81], [64]-[67], [69]-[71], [74]. It is employed in a 
wide variety of applications. Data mapping and reduction are accomplished through the use of the 
MapReduce framework [82]. Work is divided and distributed across the nodes in the distributed cluster using 
the map functions, which split and distribute the work among them. It is necessary for them to process a 
collection of key/value pairs in order to generate a set of intermediate key/value pairs. In order to resolve the 
results, the reduction functions aggregate all of the intermediate values that share a common intermediate key, 
as well as all of the work that must be done in order to resolve the results. Apache Hadoop is a Java-based 
framework for MapReduce implementation that is available as a free download [83]. It provides support for the 
processing of large datasets in a distributed computing environment, as well as the storage of data in a 
distributed file system that is both fault-tolerant and scalable (HDFS). Hadoop breaks the project down into 
smaller tasks, such as mapping and reducing tasks, and then combines them all together to complete the job. 
It is necessary to divide the input data into segments of a predetermined size, which are referred to as input 
splits [84]. Each map task is executed in a separate thread and deals with a logical partition of the data that is 
stored on the HDFS [6], [59], [60], [64], [65], [69], [72], [85]. 


3. METHOD 

The KNN [1], [2], [44], [50], [54]-[56], [86]-[88], [4], [26], [27], [32], [37]-[42] is a non-parametric 
pattern recognition approach that is used for classification and regression in pattern recognition. In our example, 
the algorithm is utilized for classification, with the outcome being a class membership [89]. Objects are 
categorized by a majority vote of their neighbors, with the item being given to the class that has the greatest 
number of members among its k most immediate neighbors (k is a positive integer, typically small). If k = 1, 
then the object is simply assigned to the class of the object’s nearest neighbor, which is the single nearest 
neighbor. Due to the fact that it memorizes the whole dataset, the KNN approach is considered a lazy 
learning classifier. Instead, it creates a discriminative model, similar to many of the classifiers that have been 
studied in the past (i.e., neural networks and decision trees). KNN plots nonlinear decision boundaries (Xie) 
on a graph and assists in assigning data points to one of the areas plotted on the graph [90]. It is the 
fundamental concept of KNN that if a test sample, which is an n-dimensional vector, is similar to a training 
sample, then it belongs to the training sample class [91]. In this case, the distance between the test sample 
and all of the other samples in the training set is calculated, and the sample is put into a class of the k-nearest 
samples, where k is an integer that shows how big the area around the test sample is. 

The performance of KNN is influenced by the value of k and the proximity measure that is used. 
If the k value is set too low, the test sample will be influenced by the noisy points, and the test sample may 
suffer from an overfitting problem as a result. On the other hand, if the k number is excessively large, it is 
possible that the neighborhood may not accurately represent the true class. In this study, we examined k values 
of 1, 10, and 100 (i.e., fine, medium, and coarse neighborhoods) to investigate if the accuracy of the model was 
affected by the values of k. In addition, we employed a variety of different distance measurements to evaluate 
their overall effectiveness. One of the most essential difficulties to consider when using proximity-based 
classifiers is that they are sensitive to the dimensionality and range of the data values being considered. 
Therefore, we employed z-score normalization to reduce difficulties with proximity estimations, which we 
found to be particularly problematic. Figure 2 shows the KNN algorithm classifying a new sample. 
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Figure 2. The KNN algorithm classifying a new sample 
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When predicting the classification of a new sample point, KNN uses a database in which the data points 
have been divided into multiple classes to make its predictions. There is no stated requirement for a training step in 
the algorithm. It is based on the resemblance of features. KNN selects the k points in the labeled data-set that 
most closely reflect the characteristics of each new sample point and determines the class for that data point 
based on which label appears more frequently in the k picked points from the training data for each new 
sample point. Using the KNN algorithm, the k nearest neighbors is determined by computing the distance 
between the test and training data points. In a similar way to Kmeans, the distance computations account for 
the majority of the calculations. Even though KNN doesn’t need as many iterations as K-means, it doesn’t need 
as many iterations to get to a result either. Once the k nearest neighbors is found, a conclusion can be made. 

In this section, various aspects of the implementation of the algorithm with the Radoop framework 
are discussed in further detail. There are various phases involved in the parallelization process. When it came 
to executing the experiments, we employed three supercomputers: one for testing and tweaking the 
circumstances under which they were ultimately carried out, and one to run them on a virtual machine 
cluster. Because we are working with enormous amounts of data, it is possible that a dataset has a significant 
number of transactions. The suggested approach is implemented as a series of “map and reduce” operations. 
The algorithm demonstrates the parallel implementation of K-means in MapReduce for K = 1 and two 
features, although the WordCount technique is used as an example in the majority of MapReduce examples. 
(Cx1, Cy1) and (Cx2, Cy2) are the feature values for the centroids of the two clusters, respectively. (Cx1, Cy1) 
and (Cx2,Cy2) are the feature values for the centroids of the two clusters. During the map phase, the outputs 
for each data point (x,y) are represented as a key-value pair, with keys being either 1 or 2 and referring to the 
cluster to which the data point is most closely associated. The key and values are computed using (1). 


Key = Argmin (1 < i <)(|x — Cxi| cyi) t? ) (1) 
Value = (x.y) 


A further step involves shuffling and sorting the key-value pairs. The pairings with the same key are routed 
to reduction functions, where the total of values for both features (x and y) is determined for all of the data in 
each cluster using the reduced functions (key). Then, the output is divided by the number of data points in the 
sample to get the new centroid values. 


Proposed Parallel K-Nearest Neighbor 
Input: S (Dense array) Output: T (Reduced array) 


1 Begin 
Run spark context (Slave) 
Master connection is being listened to 
Receive a dense array of data 
Check the length of the columns M 
Parallelize the data for processing; 
N rows of data were gathered 
do in parallel 
Set the initial synaptic weights wij and thresholds j to tiny random values, such as [0, 1], and 
then repeat the procedure. Affect the learning rate parameter and the forgetting factor with 
modest positive values as well. 
for each r in Ri do 
Intilize KNN (r)with KNN (r .Si.K); 
ok (r.S) = maxo E KNN (r .Si.K)distr(r.8); 


OMWATHDO BW DY 


10 Calculate the output of the neuron at iteration T. 

11 Make changes to the weights in the network.: wij(p + 1)=wij(p) + Awi; (p), // i, j=l, 2, ..., n 
12 T should be sent as a reduction array. 

13 Maintain a close connection 

14 End 


We focused on supervised classification algorithms created by Google, such as Naive Bayes, neural 
networks, k-nearest neighbors, and random forest. A summary of our experiments is shown in Table 1. 
The running times of the parallel generalized Hebbian algorithm (GHA) and parallel principal component 
analysis (PCA) on the identical hardware arrangement are presented in Table 2. Six unique, huge datasets 
were utilized to evaluate and compare the performance of the Apache Spark MLlib 2.0 package. Six datasets 
are available from the University of California, Irvine machine learning repository. The experimental 
architecture discussed in this article, in particular, consisted of a single Spark cluster that was developed in 
Java and used Apache Zeppelin 0.7.1 as an editor as well as an HDFS storage system. The Spark cluster is 
made up of several components, the most important of which are the master node, which is in charge of 
executing the driver application, and three worker nodes, which are in charge of data processing (including 
one worker node that runs on the master node). They all use the same software package and have the same 
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hardware setup (Intel® CoreTM i7-6700 CPU running at 3.40 GHz, 16 GB of RAM, 8 logical cores, and 
Windows 10 operating system) (see Table 2). 

The three worker nodes were each allotted a total of 48 GB of RAM. In each worker node, four 
executors were installed, each of which had four gigabytes of RAM and two processor cores. A total of three 
executors were installed on each worker on the master node, each of which had five gigabytes of RAM and 
two CPUs. A total of 16 GB of accessible RAM was made available to the driver process. As previously 
indicated, the MLlib was executed using the Scala 2.11.8 programming language in a Spark 2.2.1 cluster with 
Hadoop 2.7.3 providing distributed storage, with the results being stored in Hadoop 2.7.3. By changing the 
amount of RAM available to the executors in each worker node while keeping as many data partitions as the 
architecture could handle, the execution time was made as short as possible. Several big classification 
datasets were used in this work, which were collected from the University of California, Irvine data 
repository. Table 1 presents the important properties of these datasets, including the number of records, 
attributes, and classes for each dataset, within each dataset. 


Table 1. Datasets description 


Data No of record No of attributes No of classes 
Covtype 581012 54 7i 
Covtype-2 581012 54 2 
Higgs 11,000,000 28 2 
Botnet Attacks 7,062,606 115 10 
Dota2 102944 116 2 
SUSY 5,000,000 18 2 


Table 2. System description 


Parameter Description 

Operating system Windows 10 

CPU Intel® Core™ 17-6700 CPU @ 3.40 GHz with 8 logical cores 
Memory 16 GB 

Number of workers 3 

Computational framework Apache Spark 2.2.1 

compatible framework Radoop 


Distributed storage system HDFS (Hadoop 2.7.3) 
Editor for code development Apache Zeppelin 0.7.1 
Language used for coding Scala 2.11.8 


Table 3. Confusion matrices description 
Matrices Predicted positive Predicted negative 
Actual positive True positive (TP) False negative (FN) 
Actual negative False positive (FP) True negative (TN) 


4. RESULTS AND DISCUSSION 

In this section, the results of the experiments that were conducted and the discoveries that were 
obtained after applying the indoor positioning k-nearest neighbor algorithm that was described in the 
methodology are broken down and discussed. The effectiveness of parallel KNN is evaluated for accuracy 
and speed in terms of data and model parallel KNN, as well as for the efficacy of parallel KNN for accuracy 
and speed in terms of parallel KNN. In Table 3, we can find a description of the confusion matrix. 


4.1. Accuracy 

According to the data presented in the table that came before it, the formula for determining how 
accurate a test is is (TP + TN)/(TP + FP + FN + TN). This represents the total number of true positive and 
true negative instances as a percentage of the total number of cases. Figure 3 demonstrates how significantly 
more effective the enhanced parallel KNN algorithm is in comparison to its predecessor. 


4.2. Recall 

The recall (also known as sensitivity) of a genuine positive is the ratio of the number of true positives 
to the total number of true positives. In mathematical terms, the recall formula is: TP/(TP + FN). Using this 
method, we can determine how well our model performs in terms of identifying the real, genuine outcome. 
Figure 4 shows the recall of the improved parallel KNN algorithm compared with the original algorithm. 
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Figure 3. The accuracy of improved parallel KNN algorithm compared with original algorithm 
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Figure 4. The recall of improved parallel KNN algorithm compared with original algorithm 


For we utilize KNN to build the model, we can get the highest level of accuracy possible when 
forecasting absences. KNN has the highest success rate when aiming to ensure that the availability forecast is 
as accurate as possible. In this case, KNN may use similarity-based learning approaches to locate the not-so- 
easy-to-access dataset, or agent predictability. Once the system has found trends in the similarities of an 
employee’s absences, it may predict when the individual will call in absent the following time. This forecast 
has significant restrictions, but it is generally accurate. When an employee calls in sick, the KNN uses the 
information to construct future instances of the employee. In this strategy, the call center will have enough 
data to construct a data set that will be utilized to estimate the predictability of each employee’s agent 
predictability set. We compare it to the most fundamental generic techniques. For each approach, based on 
these comparisons, we can see the strength of our technique, which is more accurate than all of the classic 
approaches available and competitive with other powerful state-of-the-art methodologies. We approach our 
strategy in the same way we did previously. 


4.3. Execution time 

The default calculation method is serial calculation. Serial computation is characterized by the fact 
that each calculation pass is scheduled to be performed on a single processor. If the calculations are triggered 
by a calculation script, they are run sequentially in the order in which they occur in the calculation script, 
unless otherwise specified in (2). Parallel computing divides each calculation pass into a number of smaller 
sub-tasks. All of the sub-tasks that are capable of running independently of one another are scheduled to be 
executed simultaneously on up to 64 or 128 threads, depending on the configuration. Figure 5 shows the 
execution time comparison of serial KNN vs IPKNN. 


Lseria 
Serial (2) 


S = 
Latenc; 
y Lparallel 


Across all data and model sets, the IPKNN implementation outperformed the single-node KNN 
implementation in all of our studies. Figure 5 shows that data parallel KNN is significantly faster than single-node 
KNN. The number of cores influences how fast the computer performs. This experiment was carried out on 
Internet cluster 1. The main purpose was to better understand the algorithm by determining how it behaves as 
the number of processing cores on the machines increased. There are a few aspects that must be considered 
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when assessing the data. As a result, the dataset is divided into exactly the same number of pieces as the total 
number of cores in the cluster, resulting in an overall result of n pieces. As a result of this configuration, we can 
ensure that every available core is always working on one or more bits of data from the dataset. This enables us 
to fully utilize the Spark configuration, keeping in mind that it is recommended that the dataset be broken 
into chunks at least equal in size to those accessible on the cluster’s number of cores. The slave node could 
only have one executor for the two-core experiment, and each executor could only use 15 GB of RAM. 


EXCUTION TIME 
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Figure 5. Execution time comparison of Serial KNN vs IPKNN 


5. CONCLUSION 

As a result, the necessity to develop new methods for processing huge datasets in order to reduce 
time costs prompted the creation of a data categorization strategy based on parallel partitions. Despite the 
increased number of clusters, separating data into batches aids in decreasing the classification algorithm’s 
reaction time and associated processing costs. Parallel implementation of the proposed technique on a large 
number of nodes was found to be approximately twice as fast. This study compared the performance of the 
suggested technique on six datasets, and the results are reported in this report. This technique, which may be 
applied to big data analysis in medical, industrial, and business sectors, among others, can accomplish 
massive data categorization in a short period of time. Some of Python’s parallel programming constraints are 
novel, as are some of the limits of other programming languages. In comparison to other programming 
languages, Python has fewer references to learn and fewer applications to implement. Experiments for the 
technique were also conducted on a single computer, which resulted in a number of system failures and 
difficulties with limiting the time of the recommended algorithm, among other issues. Future study could 
look into running the parallel algorithm in the cloud, as well as running the KNN in parallel on larger 
datasets. These are merely a few ideas for future research. In the future, the problem will be overcome by 
altering the number of centroids in each batch as the batch size changes. Furthermore, determining how to 
identify appropriate starting centroids, which is still a significant challenge in classification, will aid in 
increasing the accuracy of the parallel batch classification method, which is now not very accurate. 
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